ScreenIT Quality Assurance Results
Comparison between COVID preprint analyses with ScreenIT before and after update
The data prior to the update were taken from the latest database. The updated version screenings were done via a different API that only sees the PDFs and no meta-data from the preprint servers. Two hundred preprints were selected for the comparison, but due to bugs in the pipeline, four preprints could not be screened with the updated pipeline, resulting in a total of 196 screened papers.
Sciscore Results
Compared to the previous version, the updated version more often delivered “not required” for the ethics statement, for example for modeling papers. However, this affected all downstream analyses as well, even in the case when there were actually statements related to e.g. randomization, attrition, etc. (Figure 1). In addition, a couple of funding statements were incorrectly detected as ethics statements, attrition had a few false positives, blinding a few false negatives, and power analysis a couple of false positives and one false negative.
rtransparent Results
A full 100% of preprints in the data set had conflict of interest statements and funding statements, however only some had included these in the pdf of the manuscript (Figure 2). As the updated version screened only the pdf input, the manual assessment also was based only on the text in the pdf. The updated version of the pipeline missed coi statements if they were named with non-standard names (e.g. “conflicts:” or without section title) or if they were given on the first page of the manuscript.
Similarly, the updated version of the pipeline missed funding statements if they were named with non-standard names (e.g. “financial disclosure”, “financing”, “funding/support”, etc.) or if the funding information was in the acknowledgements. Sometimes statements were missed if they were found on the first page of the pdf (the text screened was missing the first page).
The updated version of the pipeline did not detect registration numbers in 16 cases where the previous version did. The majority (14) of those were correct calls, with the exception of two cases where a PROSPERO registration number was cited but not found in the preprint.
limitation-recognizer Results
There were only three discrepancies between the previous and updated pipeline versions (Figure 3). In all three the updated version caught limitations that the previous version did not.
TrialIdentifier Results
The updated version of TrialIdentifier yields several what appear to be false positives (Figure 4). Some of these are grant numbers or accession numbers given in supplemental tables.
JetFighter Results
The updated JetFighter version detected seven papers that the previous version did not (Figure 5). In addition, it (falsely?) detected the fluorescent microscopy image shown in Figure 6.
Barzooka Results
In addition to the comparison of the previous and updated version of the pipeline (for all tools listed above), we also compared the performance of Barzooka based on two different types of input: the individually extracted image files during pipeline processing vs. a folder of pdfs with the same preprints. Thus, the main difference was on the level of analysis, either figure-based (Barzooka in pipeline) vs. page-based (stand-alone Barzooka). Two hundred papers were screened with either Barzooka version (pipeline vs. stand-alone) and the cases where there were discrepancies between the two versions were manually validated.
Discrepancies between the two Barzooka versions on the presence or absence of a figure type were detected in 103 out of 200 papers, with discrepancies found for all figure types (Fig. Figure 7).
For most categories, especially “approp”, “bardot”, “dot”, and “pie”, the stand-alone version generally delivered better results (Fig. Figure 7). Thus, this is the recommended use of the tool and the application on extracted separate image files is to be avoided. For the stand-alone version, the occasional errors in the “bar” and “approp” categories were due to proportional data not recognized as such or histograms or bardots were miss-classified. It also more readily detected “hist” images compared to the pipeline version, although some of these detections were false-positives. There were no false negatives and only a few false positives for the “dot” and “bardot” categories, with commonly missidentified dots with whiskers, scatter plots or “bardots” with barely any bars visible. Similarly, many densely-packed dotplots or boxplots were mistaken for “violin” plots. Finally, there were several gene structure schematics and symbol-whisker plots with large squares that were mistakenly classified as “box”.